Every time I start a new data science project, I go through the same setup steps. Create folders, set up the virtual environment, add a .gitignore, write the Dockerfile. It takes time, and worse, when I donβt follow a consistent structure I end up with projects that are hard to navigate six months later.
The structure below is what I use. Itβs not revolutionary β itβs opinionated in ways that work for me β but having a standard starting point saves time and keeps things consistent across projects. Here it is.
Hereβs what each piece does and why I include it:
data: The projectβs data files. Keep raw data here and donβt overwrite it β treat it as read-only once ingested.
Dockerfile: Defines the container environment. I develop inside Docker containers (see my post on Docker + bind mounts), so this is a key part of every project, not optional scaffolding.
Makefile: Automates common tasks β building the image, running tests, launching the notebook server. A well-written Makefile means you donβt have to remember long docker run commands.
.env: Environment variables β API keys, database connection strings, anything that shouldnβt be hardcoded. Never commit this file.
.envrc: Works with direnv to automatically load the environment when you enter the project directory. Makes switching between projects less error-prone.
.gitignore: Keeps the repository clean. At minimum: .env, __pycache__, .ipynb_checkpoints, and whatever data files are too large to version.
notebooks: For exploration and deliverables. One notebook per logical question. Donβt use notebooks as a substitute for proper code β anything reusable goes into thepkg.
README.md: How to set up and run the project. Assume the reader has never seen your code before, because in three months that reader will be you.
scripts: Standalone scripts β data ingestion, batch jobs, one-off transformations. These call into thepkg rather than containing business logic directly.
setup.py: Lets you install thepkg as a local editable package with pip install -e .. Once itβs installed, you can import it cleanly from notebooks and scripts without path hacks.
thepkg: The actual Python package where reusable code lives. Organized into:
interface: Entry points and API definitions.
ml_logic: Data loading (data.py), preprocessing (preprocessor.py), and model code (model.py). Keeping these separate makes it easier to swap out one component without touching the others.
params.py: Configuration constants β model hyperparameters, file paths, column names. Centralizing these means you change things in one place.
utils.py: Utility functions that donβt belong anywhere else.
The folder structure is a template, not a contract. Some projects need more; some need less. But starting here is faster than starting from scratch, and itβs easier to remove structure you donβt need than to add it retroactively.
Below is a bash script to create the entire structure in one shot.